exit
Predictive Modeling of Weather Station Data:
Linear Regression vs. Graph Neural Network
Slides: slides.html ( Go to slides.qmd to edit)
Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.
Carefully read this template since it has instructions and tips to writing!
Introduction
Accurate weather prediction is a crucial task with widespread implications across agriculture, transportation, disaster preparedness, and energy management. Traditional forecasting methods often rely on statistical models or physics-based simulations, however, with the advancement of graphical neural networks (GNN) we believe there is potential in a more modern deep learning approach.
In this project, we explore the predictive power of a traditional linear regression model and a GNN on real-world weather station data. Our aim is to evaluate whether the GNN’s ability to incorporate spatial relationships between stations offers a measurable advantage over more conventional techniques
The dataset consists of multiple weather stations located within the same geographic region. Each station collects meteorological variables over time, and can be represented as a node within a broader spatial network. For the linear model baseline, a single model will be trained using all stations’ data aggregated per feature for each time step.
For the GNN the model will be trained on the entire network of stations, where each node corresponds to a station and edges represent spatial relationships. The graph is encoded via a dense adjacency matrix, excluding self-connections. The GNN aims to leverage the inherent spatial structure of the data, potentially capturing regional weather patterns and inter-station dependencies that are invisible to traditional models.
Our evaluation focuses on forecasting performance over the last 6-months of the dataset. We asses how well each modelling approach predicts key weather variables and investigate the conditions under which one model may outperform the other.
Methods
This section outlines the modeling approaches, data structure, and training procedures used to compare the traditional linear model and the GNN on weather station data.
1. Data Structure
The base dataset consists of 33 features, over 8 stations, with 96,408 hourly time-steps with intermittent sections of missing data. The key features of interest are as follows with descriptions from (Herzmann 2023):
| Feature | Description |
|---|---|
| station | Station identifier code (3/4 characters) |
| valid | Timestamp of the observation |
| lon | Longitude of station |
| lat | Latitude of station |
| elevation | Elevation of station |
| tmpf | Air temperature in Fahrenheit |
| relh | Relative humidity in percent |
| drct | Wind direction in degrees |
| sknt | Wind speed in knots |
| p01i | One hour precipitation for the period from the observation time to the time of the previous hourly precipitation reset |
| vsby | visibility in miles |
While the dataset has many other features included, the remaining features are of little interest for the purposes of this project.
2. Cleaning Process
The cleaning process for the data was a multistage procedure that started with:
Filling in all missing time steps for all stations between the min and max dataset timestep
Compressing the timesteps from 1hr intervals down to 6hr intervals
Removing any features that were missing more than 10% of their values in more than 6 of the 8 stations.
Remove any stations that are missing more than 2 years of valid data (some stations are newer than others)
Perform correlation analysis on each feature by between nodes and within nodes
Scale remaining features with appropriate scalers
Transform remaining dataframe into an array of shape time, station, feature
3. Linear Model
The linear model is formulated as a time-series regression task. It uses the feature information from the previous twenty-eight time steps (seven days) to predict the feature values at the next time step. Each input consists of an aggregation of each nodes feature for the five meteorological features, resulting in a fixed length input vector per prediction target.
\(t_{+1}=\text{features}_t+\text{features}_{t-1}+...+\text{features}_{t-27}\)
\(\text{feature}_t=\text{tmpf, relh, sknt, drctsin, drctcos}\)
The five input features are:
- Temperature
- Relative Humidity
- Wind Speed
- Wind Direction (represented as sin and cosine components)
4. GNN
The GNN is designed to capture spatiotemporal dependencies in the weather station network. It is implemented using PyTorch and follows a structure inspired by the Diffusion Convolutional Recurrant Neural Network (DCRNN) architecture.
- Architecture
- Input Format:
- Data is structured using the StaticGraphTemporalSignal format
- Data is split such that:
- Testing = last 6 months (730 6 hr time steps)
- Training = Head 80% of the remaining data
- Validation = Tail 20% of the remaining data
- Input Format:
Each node represents a weather station and temporal sequences of node features are used for prediction.
- Layers:
- A DCRNN layer to capture spatial and temporal dependencies
- A ReLU activation function
- A Linear output layer for final prediction
- The full structure is:
- DCRN(140, 64) -> ReLU -> DCRN(64, 32) -> ReLU -> DCRN(32, 32) -> ReLU -> Linear(32, 1)
- Training Configuration
- Optimizer
- Adam
- Optimizer
- Learning Rate:
- Base learning rate of 0.01 but will reduce by 0.1 at a plateau down until 1e-5
- Epochs:
- Trained for a maximum of 100 epochs with an early exit on plateau callback
The model is trained to predict the temperature for the next time step based on the preceding twnty-eight time steps, analogous to the linear model.
Analysis and Results
Data Exploration and Visualization
1. Source:
The data used in this project is source from the Iowa Environmental Mesonet at the Iowa State University through (Herzmann 2023), where weather station data is aggregated from variable sources based on the standards shown in (Administration 2021). For our specific use case we decided to select data over a ten year period ranging from 2010 to 2020 for nine Kanasa nodes. These nodes were selected primarily due to their proximity and location.
| station | valid | lat | lon | elevation | tmpf | dwpf | relh | drct | sknt | p01i | alti | mslp | vsby | gust | skyc1 | skyc2 | skyc3 | skyc4 | skyl1 | skyl2 | skyl3 | skyl4 | wxcodes | ice_accretion_1hr | ice_accretion_3hr | ice_accretion_6hr | peak_wind_gust | peak_wind_drct | peak_wind_time | feel | snowdepth |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cat | datetime[μs] | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | cat | cat | cat | cat | f32 | f32 | f32 | f32 | str | f32 | f32 | f32 | f32 | f32 | datetime[μs] | f32 | f32 |
| "GCK" | 2020-12-08 12:00:00 | 37.927502 | -100.724403 | 881.0 | 25.533333 | 14.0 | 61.435001 | 293.333344 | 8.666667 | 0.0 | 30.24 | 1025.150024 | 10.0 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 15.738334 | null |
| "EHA" | 2013-02-09 12:00:00 | 37.000801 | -101.879997 | 1099.0 | 32.833332 | 11.571111 | 41.004444 | 166.111115 | 12.055555 | 0.0 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 23.355556 | null |
| "JHN" | 2013-05-06 00:00:00 | 37.578201 | -101.7304 | 1012.710022 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
| "19S" | 2016-06-28 00:00:00 | 37.496899 | -100.832901 | 892.570007 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
| "LBL" | 2011-08-07 18:00:00 | 37.044201 | -100.9599 | 879.0 | 93.099998 | 61.099998 | 35.308334 | 209.444443 | 7.555555 | null | 29.940001 | null | 10.0 | 14.8 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 93.669998 | null |
2. EDA
Once the nine nodes were selected further analysis was required to both reduce the volume of data as well as to ensure the quality of the data as the source is very clear about the poor quality of the data provided.
This reduction was initially done by first finding which features were missing no more than 10% of their values and then finding which nodes for a year range were missing no more than 10% of their values. During this step one node was dropped due to the fact that it was introduced after 2020 and thus had no valid data for the date-range selected.
With this visual a date range was selected foe 2018 to 2020 as this range had the most valid features and stations while also being quite recent. As this range was selected the ULS station was dropped resulting in seven valid stations.
| station | valid | lat | lon | elevation | tmpf | dwpf | relh | sknt | feel | drct_sin | drct_cos |
|---|---|---|---|---|---|---|---|---|---|---|---|
| cat | datetime[μs] | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f64 | f64 |
| "GCK" | 2018-01-01 00:00:00 | 37.927502 | -100.724403 | 881.0 | 9.283334 | -6.5 | 48.148335 | 8.666667 | -4.161667 | 0.851117 | 0.524977 |
| "LBL" | 2018-01-01 00:00:00 | 37.044201 | -100.9599 | 879.0 | 12.316667 | -1.983333 | 52.014999 | 10.166667 | -1.721667 | 0.664796 | 0.747025 |
| "EHA" | 2018-01-01 00:00:00 | 37.000801 | -101.879997 | 1099.0 | 15.555555 | 5.255556 | 63.382778 | 7.388889 | 4.482222 | 0.970763 | 0.24004 |
| "HQG" | 2018-01-01 00:00:00 | 37.163101 | -101.370499 | 956.52002 | 14.311111 | -1.605556 | 48.468334 | 7.777778 | 2.681667 | 0.936332 | 0.351115 |
| "3K3" | 2018-01-01 00:00:00 | 37.991699 | -101.7463 | 1005.700012 | 13.1 | -0.9 | 53.127777 | 6.777778 | 2.151111 | 0.981255 | 0.192712 |
| … | … | … | … | … | … | … | … | … | … | … | … |
| "EHA" | 2020-12-31 00:00:00 | 37.000801 | -101.879997 | 1099.0 | 42.355556 | 16.588888 | 34.955555 | 3.0 | 40.254444 | -0.533205 | -0.845986 |
| "HQG" | 2020-12-31 00:00:00 | 37.163101 | -101.370499 | 956.52002 | 40.722221 | 13.255555 | 32.23111 | 1.666667 | 39.450001 | 0.977334 | -0.211704 |
| "3K3" | 2020-12-31 00:00:00 | 37.991699 | -101.7463 | 1005.700012 | 40.200001 | 12.2 | 31.377777 | 4.555555 | 36.683334 | -0.700217 | -0.71393 |
| "JHN" | 2020-12-31 00:00:00 | 37.578201 | -101.7304 | 1012.710022 | 40.711113 | 17.4 | 38.554443 | 4.777778 | 37.06889 | -0.824675 | -0.565607 |
| "19S" | 2020-12-31 00:00:00 | 37.496899 | -100.832901 | 892.570007 | 39.922222 | 15.388889 | 36.552223 | 6.111111 | 35.113335 | -0.824675 | 0.565607 |
True
3. Graph Creation
Once the valid date-range was selected and all features were reduced to only those that could be imputed the graph needed to be generated.
As the number of nodes were low it made sense to generate a dense graph without self referencing. This was simple done by looping over every node and producing an adjacency matrix. The matrix was then weighted by the inverse distance scaled with a minmax scaler to ensure the values all ranged from 0 to 1.
4. Graph Imputation
Once the graph was created the data needed to be imputed both spatially and temporally. This was done through a two step process where first the data was imputed over the time slice only considering the spatial information, and then a second time temporally only considering the temporal information. While this process is not perfect it was able to accurately impute data for the spatiotemporal graph structure with no apparent issues as shown below.
array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
<Axes: >], dtype=object)
array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
<Axes: >], dtype=object)
5. Correlation Analysis
Once the data was imputed and cleaned a Correlation Analysis was run to see if any features should be removed due to strong correlation. This analysis was done both between nodes and within the node. This analysis found that dwpf and feel both had a strong correlation with the target variable tmpf. This makes logical sense as both dewpoint and feels like temperature quite literally use temperature to calculate their values. As such these two features were removed.
Following the within node correlation, features were then compared between nodes which showed if it made any sense to consider the spatial relationship of the data. As shown below there is a very strong correlation between all nodes irregardless of feature.
6. Scaling
Following all of the above steps the final dataset was reduced to 5 features, over 7 stations with 4381 6hr time steps. This dataset was then scaled through the use of sklearns robust scaler.
| station | tmpf | relh | sknt | drct_sin | drct_cos |
|---|---|---|---|---|---|
| cat | f64 | f64 | f64 | f64 | f64 |
| "GCK" | -1.355721 | 0.481483 | -0.080357 | 0.851117 | 0.524977 |
| "LBL" | -1.265174 | 0.52015 | 0.160714 | 0.664796 | 0.747025 |
| "EHA" | -1.168491 | 0.633828 | -0.285714 | 0.970763 | 0.24004 |
| "HQG" | -1.205638 | 0.484683 | -0.223214 | 0.936332 | 0.351115 |
| "3K3" | -1.241791 | 0.531278 | -0.383929 | 0.981255 | 0.192712 |
| … | … | … | … | … | … |
| "EHA" | -0.368491 | 0.349556 | -0.991072 | -0.533205 | -0.845986 |
| "HQG" | -0.417247 | 0.322311 | -1.205357 | 0.977334 | -0.211704 |
| "3K3" | -0.432836 | 0.313778 | -0.741072 | -0.700217 | -0.71393 |
| "JHN" | -0.417579 | 0.385544 | -0.705357 | -0.824675 | -0.565607 |
| "19S" | -0.441128 | 0.365522 | -0.491072 | -0.824675 | 0.565607 |
<All keys matched successfully>
RecurrentGCN(
(recurrent1): DCRNN(
(conv_x_z): DConv(204, 64)
(conv_x_r): DConv(204, 64)
(conv_x_h): DConv(204, 64)
)
(recurrent2): DCRNN(
(conv_x_z): DConv(96, 32)
(conv_x_r): DConv(96, 32)
(conv_x_h): DConv(96, 32)
)
(recurrent3): DCRNN(
(conv_x_z): DConv(64, 32)
(conv_x_r): DConv(64, 32)
(conv_x_h): DConv(64, 32)
)
(linear): Linear(in_features=32, out_features=1, bias=True)
)
MSE: 0.0562
RecurrentGCN(
(recurrent1): DCRNN(
(conv_x_z): DConv(204, 64)
(conv_x_r): DConv(204, 64)
(conv_x_h): DConv(204, 64)
)
(recurrent2): DCRNN(
(conv_x_z): DConv(96, 32)
(conv_x_r): DConv(96, 32)
(conv_x_h): DConv(96, 32)
)
(recurrent3): DCRNN(
(conv_x_z): DConv(64, 32)
(conv_x_r): DConv(64, 32)
(conv_x_h): DConv(64, 32)
)
(linear): Linear(in_features=32, out_features=1, bias=True)
)
RecurrentGCN(
(recurrent1): DCRNN(
(conv_x_z): DConv(204, 64)
(conv_x_r): DConv(204, 64)
(conv_x_h): DConv(204, 64)
)
(recurrent2): DCRNN(
(conv_x_z): DConv(96, 32)
(conv_x_r): DConv(96, 32)
(conv_x_h): DConv(96, 32)
)
(recurrent3): DCRNN(
(conv_x_z): DConv(64, 32)
(conv_x_r): DConv(64, 32)
(conv_x_h): DConv(64, 32)
)
(linear): Linear(in_features=32, out_features=1, bias=True)
)
LinearRegression()
MSE: 0.0147
[<matplotlib.lines.Line2D object at 0x000002A004F95B50>]
[<matplotlib.lines.Line2D object at 0x000002A005DE40D0>]
Text(0.5, 1.0, 'Station JHN')
[<matplotlib.lines.Line2D object at 0x000002A005DE66D0>]
[<matplotlib.lines.Line2D object at 0x000002A005DE4590>]
Text(0.5, 1.0, 'Station GCK')
[<matplotlib.lines.Line2D object at 0x000002A005B40510>]
[<matplotlib.lines.Line2D object at 0x000002A00596E2D0>]
Text(0.5, 1.0, 'Station LBL')
[<matplotlib.lines.Line2D object at 0x000002A00596E810>]
[<matplotlib.lines.Line2D object at 0x000002A00596EC50>]
Text(0.5, 1.0, 'Station HQG')
[<matplotlib.lines.Line2D object at 0x000002A00596D8D0>]
[<matplotlib.lines.Line2D object at 0x000002A00596C1D0>]
Text(0.5, 1.0, 'Station 19S')
[<matplotlib.lines.Line2D object at 0x000002A005E02D10>]
[<matplotlib.lines.Line2D object at 0x000002A005E01850>]
Text(0.5, 1.0, 'Station 3K3')
[<matplotlib.lines.Line2D object at 0x000002A005E00350>]
[<matplotlib.lines.Line2D object at 0x000002A005E016D0>]
Text(0.5, 1.0, 'Station ULS')
<matplotlib.legend.Legend object at 0x000002A00596FED0>
Text(0.5, 0.98, 'LR Actual vs Predicted Over Time for Each Node')
[<matplotlib.lines.Line2D object at 0x000002A005F8DB50>]
Text(0.5, 1.0, 'Station JHN')
[<matplotlib.lines.Line2D object at 0x000002A00572DDD0>]
Text(0.5, 1.0, 'Station GCK')
[<matplotlib.lines.Line2D object at 0x000002A00572D010>]
Text(0.5, 1.0, 'Station LBL')
[<matplotlib.lines.Line2D object at 0x000002A00572C790>]
Text(0.5, 1.0, 'Station HQG')
[<matplotlib.lines.Line2D object at 0x000002A0059F7E10>]
Text(0.5, 1.0, 'Station 19S')
[<matplotlib.lines.Line2D object at 0x000002A0059F6F50>]
Text(0.5, 1.0, 'Station 3K3')
[<matplotlib.lines.Line2D object at 0x000002A0059F73D0>]
Text(0.5, 1.0, 'Station ULS')
<matplotlib.legend.Legend object at 0x000002A005F77450>
Text(0.5, 0.98, 'LR Absolute Error for Each Station')
RecurrentGCN(
(recurrent1): DCRNN(
(conv_x_z): DConv(204, 64)
(conv_x_r): DConv(204, 64)
(conv_x_h): DConv(204, 64)
)
(recurrent2): DCRNN(
(conv_x_z): DConv(96, 32)
(conv_x_r): DConv(96, 32)
(conv_x_h): DConv(96, 32)
)
(recurrent3): DCRNN(
(conv_x_z): DConv(64, 32)
(conv_x_r): DConv(64, 32)
(conv_x_h): DConv(64, 32)
)
(linear): Linear(in_features=32, out_features=1, bias=True)
)
[<matplotlib.lines.Line2D object at 0x000002A004DEDB50>]
[<matplotlib.lines.Line2D object at 0x000002A0059E3290>]
Text(0.5, 1.0, 'Station JHN')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002A005AC6110>]
[<matplotlib.lines.Line2D object at 0x000002A0052EE590>]
Text(0.5, 1.0, 'Station GCK')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002A0054C5D10>]
[<matplotlib.lines.Line2D object at 0x000002A0054C5950>]
Text(0.5, 1.0, 'Station LBL')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002A0054C6DD0>]
[<matplotlib.lines.Line2D object at 0x000002A0054C45D0>]
Text(0.5, 1.0, 'Station HQG')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002A00547F490>]
[<matplotlib.lines.Line2D object at 0x000002A00547F5D0>]
Text(0.5, 1.0, 'Station 19S')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002A00547E410>]
[<matplotlib.lines.Line2D object at 0x000002A00547EC50>]
Text(0.5, 1.0, 'Station 3K3')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002A00586FD50>]
[<matplotlib.lines.Line2D object at 0x000002A00547D790>]
Text(0.5, 1.0, 'Station ULS')
(0.0, 500.0)
<matplotlib.legend.Legend object at 0x000002A004BFC4D0>
Text(0.5, 0.98, 'GNN vs. LR Absolute Error for Each Station')
Modeling and Results
Explain your data preprocessing and cleaning steps.
Present your key findings in a clear and concise manner.
Use visuals to support your claims.
Tell a story about what the data reveals.
Conclusion
Summarize your key findings.
Discuss the implications of your results.